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ABSTRACT 

Many data types arising from data mining applications can 
be modeled as bipartite graphs, examples include terms and 
documents in a text corpus, customers and purchasing items 
in market basket analysis and reviewers and movies in a 
movie recommender system. In this paper, we propose a new 
data clustering method based on partitioning the underlying 
bipartite graph. The partition is constructed by minimizing 
a normalized sum of edge weights between unmatched pairs 
of vertices of the bipartite graph. We show that an approxi- 
mate solution to the minimization problem can be obtained 
by computing a partial singular value decomposition (SVD) 
of the associated edge weight matrix of the bipartite graph. 
We point out the connection of our clustering algorithm to 
correspondence analysis used in multivariate analysis. We 
also briefly discuss the issue of assigning data objects to 
multiple clusters. In the experimental results, we apply our 
clustering algorithm to the problem of document clustering 
to illustrate its effectiveness and efficiency. 

Categories and Subject Descriptors 

H.3.3 [Information Search and Retrieval]: Clustering; 
G.1.3 [Numerical Linear Algebra]: Singular value de- 
composition; G.2.2 [Graph Theory]: Graph algorithms 

General Terms 

Algorithms, theory 

Keywords 

document clustering, bipartite graph, graph partitioning, 
spectral relaxation, singular value decomposition, correspon- 
dence analysis 
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1. INTRODUCTION 

Cluster analysis is an important tool for exploratory data 
mining applications arising from many diverse disciplines. 
Informally, cluster analysis seeks to partition a given data 
set into compact clusters so that data objects within a clus- 
ter are more similar than those in distinct clusters. The liter- 
ature on cluster analysis is enormous including contributions 
from many research communities, (see Q ^ for recent sur- 
veys of some classical approaches.) Many traditional clus- 
tering algorithms are based on the assumption that the given 
dataset consists of covariate information (or attributes) for 
each individual data object, and cluster analysis can be cast 
as a problem of grouping a set of n-dimensional vectors each 
representing a data object in the dataset. A familiar ex- 
ample is document clustering using the vector space model 
Here each document is represented by an n-dimensional 
vector, and each coordinate of the vector corresponds to a 
term in a vocabulary of size n. This formulation leads to 
the so-called term-document matrix A = {atj) for the rep- 
resentation of the collection of documents, where aij is the 
so-called term frequency, i.e., the number of times term i 
occurs in document j. In this vector space model terms and 
documents are treated asymmetrically with terms consid- 
ered as the covariates or attributes of documents. It is also 
possible to treat both terms and documents as first-class 
citizens in a symmetric fashion, and consider aij as the fre- 
quency of co-occurrence of term i and document j as is done, 
for example, in probabilistic latent semantic indexing |l^].f| 
In this paper, we follow this basic principle and propose a 
new approach to model terms and documents as vertices 
in a bipartite graph with edges of the graph indicating the 
co-occurrence of terms and documents. In addition we can 
optionally use edge weights to indicate the frequency of this 
co-occurrence. Cluster analysis for document collections in 
this context is based on a very intuitive notion: documents 
are grouped by topics, on one hand documents in a topic 
tend to more heavily use the same subset of terms which 
form a term cluster, and on the other hand a topic usually 
is characterized by a subset of terms and those documents 
heavily using those terms tend to be about that particular 
topic. It is this interplay of terms and documents which 
gives rise to what we call bi-clustering by which terms and 
documents are simultaneously grouped into semantically co- 



^Our clustering algorithm computes an approximate global 
optimal solution while probabilistic latent semantic indexing 
relies on the EM algorithm and therefore might be prune to 
local minima even with the help of some annealing process. 



herent clusters. 

Within our bipartite graph model, the clustering prob- 
lem can be solved by constructing vertex graph partitions. 
Many criteria have been proposed for measuring the quality 
of graph partitions of undirected graphs ^, Q . In this pa- 
per, we show how to adapt those criteria for bipartite graph 
partitioning and therefore solve the bi-clustering problem. 
A great variety of objective functions have been proposed 
for cluster analysis without efficient algorithms for finding 
the (approximate) optimal solutions. We will show that our 
bipartite graph formulation naturally leads to partial SVD 
problems for the underlying edge weight matrix which ad- 
mit efficient global optimal solutions. The rest of the paper 
is organized as follows: in section ^ we propose a new crite- 
rion for bipartite graph partitioning which tends to produce 
balanced clusters. In section ^, we show that our criterion 
leads to an optimization problem that can be approximately 
solved by computing a partial SVD of the weight matrix of 
the bipartite graph. In section ^ we make connection of 
our approximate solution to correspondence analysis used 
in multivariate data analysis. In section we briefly dis- 
cuss how to deal with clusters with overlaps. In section ^, 
we describe experimental results on bi-clustering a dataset 
of newsgroup articles. We conclude the paper in section |^ 
and give pointers to future research. 

2. BIPARTITE GRAPH PARTITIONING 

We denote a graph by G{V,E), where V is the vertex 
set and E is the edge set of the graph. A graph G{V, E) is 
bipartite with two vertex classes X and Y iiV = XVJY with 
X n F = and each edge in E has one endpoint in X and 
one endpoint in Y . We consider weighted bipartite graph 
G{X,Y,W) with W = {wij) where Wij > denotes the 
weight of the edge between vertex i and j. We let Wij = 
if there is no edge between vertices i and j. In the context 
of document clustering, X represents the set of terms and 
Y represents the set of documents, and Wij can be used to 
denote the number of times term i occurs in document j. A 
vertex partition of G{X, Y, W) denoted by Il{A, B) is defined 
by a partition of the vertex sets X and Y, respectively: X — 
AU A", and Y = BU B", where for a set S, S" denotes its 
compliment. By convention, we pair A with B, and A'^ with 
B'^. We say that a pair of vertices x £ X and y £ Y is 
matched with respect to a partition 11(^4,5) if there is an 
edge between x and y, and either x £ A and y £ B or 
X £ A" and y £ B'^. For any two subsets of vertices S G X 
and T CY, define 

W{S,T)^ ^ 

i.e., W{S,T) is the sum of the weights of edges with one 
endpoint in S and one endpoint in T. The quantity W{S, T) 
can be considered as measuring the association between the 
vertex sets 5* and T. In the context of cluster analysis edge 
weight measures the similarity between data objects. To 
partition data objects into clusters, we seek a partition of 
G{X, Y, W) such that the association (similarity) between 
unmatched vertices is as small as possible. One possibility 
is to consider for a partition Tl{A, B) the following quantity 



cut(A,B) =W{A,B'') + W{A'',B) 



(1) 



Intuitively, choosing Il{A, B) to minimize cut(A, B) will give 
rise to a partition that minimizes the sum of all the edge 
weights between unmatched vertices. In the context of doc- 
ument clustering, we try to find two document clusters B 
and B'^ which have few terms in common, and the docu- 
ments in B mostly use terms in A and those in B'^ use terms 
in A'^. Unfortunately, choosing a partition based entirely 
on cut{A,B) tends to produce unbalanced clusters, i.e., the 
sizes of A and/or B or their compliments tend to be small. 
Inspired by the work in |5|, |l4| , we propose the following 
normalized variant of the edge cut in (0) 



Ncut(A,B) 



cut{A,B) 



W(A,Y) + W{X,B) 



cut(yl^B'= 



W{A'',Y) + W{X,B'^)' 

The intuition behind this criterion is that not only we want 
a partition with small edge cut, but we also want the two 
subgraphs formed between the matched vertices to be as 
dense as possible. This latter requirement is partially sat- 
isfied by introducing the normalizing denominators in the 
above equation]^ Our bi-clustering problem is now equiva- 
lent to the following optimization problem 

min Ncut(yl,_B), 

n{A,B) 

i.e., finding partitions of the vertex sets X and Y to minimize 
the normalized cut of the bipartite graph G{X, Y, W). 

3. APPROXIMATE SOLUTIONS USING SIN- 
GULAR VECTORS 

Given a bipartite graph G{X, Y, W) and the associated 
partition Il[A,B). Let us reorder the vertices of X and Y 
so that vertices in A and B are ordered before vertices in A"^ 
and B'^ , respectively. The weight matrix W can be written 
in a block format 



W = 



W21 



W12 
W22 



(2) 



i.e., the rows of W\i correspond to the vertices in the ver- 
tex set A and the columns of Wu correspond to those in 
B. Therefore G{A,B,Wi\) denotes the weighted bipartite 
graph corresponding to the vertex sets A and B. For any 
m-by-n matrix H = {hij), define 



=1 



i.e., s{H) is the sum of all the elements of H. It is easy to 
see from the definition of Ncut, 



Ncut(A,B) = 



s(Wl2)+s(H/2l) 



2s(vyii) + s(m2) + s(w/2i) 



S{W^2)+S{W21) 



2s{W22)+s{Wu)+s{W2l) 



A more natural criterion seems to be 

cut{A,B) ^ cut(A=,B= 



W{A,B) W(A=,B=)' 

However, it can be shown that it will leads to an SVD prob- 
lem with the same set of left and right singular vectors. 



In order to make connections to SVD problems, we first 
consider the case when W is symmetrical It is easy to see 
that with W symmetric (denoting Ncut(j4,^) by Ncut(j4)), 
we have 



Ncut(4) 



sjWv. 



+ 



sjWr, 



s{Wll)+s{Wl2) S{W22)+S{W12)' 



(3) 



Let e be the vector with all its elements equal to 1. Let D be 
the diagonal matrix such that We = De. Then {D — W)e — 
0. Let X = (xi) be the vector with 

1, i€ A, 

-1, ieA". 

It is easy to verify that 

8(14^12) = x'^{D - W)x/4. 

Define 

s{Wll) + s{Wl2) _ s{Wll) + s{Wl2) 



8{Wll) + 28{Wl2) + S{W22) 



e^De 



Then 



and 



s(mi)+s(Wi2) =pe^Z)e, 
s(W^22) + s{Wi2) = (1 - p)e'^De, 



Ncut(yl) 



x'^{D-W)x 
4p(l —p)e^De' 



(se + xf{D - W){se + x) = x^(_D - W)x. 

To cast (^) in the form of a Rayleigh quotient, we need to 
find a such that 

(se + x)^ D{ae + x) — 4p(l — p)e^De. 

Since x^Dx — De, it follows from the above equation 
that s = 1 — 2p. Now let y = {1 — 2p)e + x, it is easy to see 
that y^De = ((1 - 2p)e + xf" De = 0, and 



Vi = 



2(l-p)>0, ieA, 
-2p < 0, ie A". 



Thus 



minNcut(yl) = min j^^^i^— ^ I y G S 
A [ Dy 

where 

S = {y \ y^De^0,yi€{2{l-p),-2p}}. 

If we drop the constraints yi G {2(1 — p), —2p} and let the 
elements of y take arbitrary continuous values, then the op- 
timal y can be approximated by the following relaxed con- 
tinuous minimization problem. 



y^ Dy 

Notice that it follows from We = De that 

D-^^^WD-^^\D^^^e)=D~^^\ 



(5) 



^ A Hiflrprpnt proof for the symmetric case was first derived 
in |l4|. However, our derivation is simpler and more trans- 
parent and leads naturally to the SVD problems for the 
rectangular case. 



and therefore D^^^e is an eigenvector of D~^^^W D~^^^ cor- 
responding to the eigenvalue 1. It is easy to show that all the 
eigenvalues of D~^^^W D~^^^ have absolute value at most 
1 (See the Appendix). Thus the optimal y in (jsj) can be 
computed as y = D^^^y, where y is the second largest eigen- 
vector oi D-^/'^WD-^/'^. 

Now we return to the rectangular case for the weight ma- 
trix W, and let Dx and Dy be diagonal matrices such that 



We = Dxe, W'^e = Dye. 
Consider a partition Il{A,B), and define 



(6) 



1, le A 
-1, ieA" 



1, ie B 
-1, i e B' 



Let W have the block form as in (^, and consider the aug- 
mented symmetric matrixn 



W = 







W 







Wil 

WT2 W^2 



Wii W12 
W21 W22 



If we interchange the second and third block rows and columns 
of the above matrix, we obtain 








Wii 





W12 " 


(4) 















W21 





W22 


we have 


W12 





W^2 






Wii 

T 
12 



W^2 



W12 
W22 



and the normalized cut can be written as 

s{Wl2) , s{Wl2) 



Ncut(^,_B) = 



+ 



s{Wll) + s{Wl2) S{W22) + S{W12) 

a form that resembles the symmetric case Define 

_ 2s(mi)-hs(Wl2)-Fs(I¥2l) 



Then we have 

Ncut(A,_B) 



e'^ Dxe -\- e"^ Dye 



-2x'^Wy + x'^Dxx + y'^Dyy 
x'^Dxx + y'^Dyy 



2x'^Wy 



x^Dxx + y'^Dyy ' 



where a; = (1 — 2p)e + u,y = {1 — 2p)e + v. It is also easy to 
see that 



x^ Dxe -f y"^ Dye = 0, 



,Vi£{2(l~q),^2q}. (7) 



Therefore, 



min Ncut(A,B) 

Yl(A,B) 



1 



max 



2x'^Wy 



7^o,y^o [ x'^Dxx + y'^Dyy 



X, y sal 



tisfy §) 



^In [hll , the Laplacian of W is used for partitioning a rect- 
angular matrix in the context of designing load-balanced 
matrix- vector multiplication algorithms for parallel compu- 
tation. However, the eigenvalue problem of the Laplacian of 
W does not lead to a simpler singular value problem. 



Ignoring the discrete constraints on the elements of x and 
y, we have the following continuous maximization problem, 



(8) 

Without the constraints x^ Dxe + y^ Dye — 0, the above 
problem is equivalent to computing the largest singular triplet 
of D^^^^W Dy^^^ (see the Appendix). From we have 

D-'^'WD-'^'{Dl/'e) = D'fe, 

{D-'^'WD-'^Y{D]l'e) = Dl/'e, 

and similarly to the symmetric case, it is easy to show 
that all the singular values of D^^^^WDy^^^ are at most 1. 
Therefore, an optimal pair {a;, y} for ^ can be computed 
as 1 = D^^^x and y = Dy^^^y, where x and y are the sec- 
ond largest left and right singular vectors of D^^^^W Dy^^^ , 
respectively (see the Appendix). With the above discus- 
sion, we can now summerize our basic approach for bipartite 
graph clustering incorporating a recursive procedure. 



Algorithm. Spectral Recursive Embedding (SRE) 
Given a weighted bipartite graph G = (X, Y, E) with 
its edge weight matrix W: 

1. Compute Dx and Dy and form the scaled weight 



matrix W = D^^^'^WDy 



1/2 



2. Compute the second largest left and right singular 
vectors oiW,x and y. 

— 1/2 

3. Find cut points Cx and Cy for x = x and 
y = Dy^^^y, respectively. 

4. Form partitions A = {i \ Xi > c^} and 
A'^ — {i I Xi < Cx} for vertex set X, and 
B = {j I yj > Cy} and B" = {j \ y^ < Cy} for 
vertex set Y. 

5. Recursively partition the sub-graphs G{A, B) and 
G{A'',B'') if necessary. 



Two basic strategies can be used for selecting the cut 
points Cx and Cy. The simplest strategy is to set Cx = G 
and Cy — 0. Another more computing-intensive approach 
is to base the selection on Ncut: Check TV equally spaced 
splitting points of x and y, respectively, find the cut points 
Cx and Cy with the smallest Ncut [ [l4[ . 

Computational complexity. The major computational 
cost of SRE is Step 2 for computing the left and right singu- 
lar vectors which can be obtained either by power method 
or more robustly by Lanczos bidiagonalization process [^, 
Chapter 9]. Lanczos method is an iterative process for com- 
puting partial SVDs in which each iterative step involves 
the computation of two matrix-vector multiplications Wu 
and W^v for some vectors u and v. The computational cost 
of these is roughly proportional to nnz{W), the number of 
nonzero elements of W. The total computational cost of 
SRE is 0(csrcfcavdnnz(Vt^)), where Csro the the level of re- 
cursion and fcavd is the number of Lanczos iteration steps. 



In general, fcsvd depends on the singular value gaps of W. 
Also notice that nnz(W^) — n^n, where is the average 
number of terms per document and n is the total number 
of document. Therefore, the total cost of SRE is in general 
linear in the number of documents to be clustered. 

4. CONNECTIONS TO CORRESPONDENCE 
ANALYSIS 

In its basic form correspondence analysis is a pp lied to 
an m-by-n two-way table of counts W ^ ^ Let 
w = s{W), the sum of all the elements of W , Dx and Dy 
be diagonal matrices defined in section |^. Correspondence 
analysis seeks to compute the largest singular triplets of the 
matrix Z = (z^) £ 7^'"''" with 

_ Wij/w ~ {Dx{i,i)/w){Dy{j,j)/w) 



^(Dxii,t)/w){Dy{j,j)/w) 



The matrix Z can be considered as the correlation matrix 
of two group indicator matrices for the orig inal H/ 0. We 
now show that the SVD of Z is closely related to the SVD 



oi W = D^^^W Dy^' ^ . In fact, in section |3|, we showed 
that D^^e and Dy^e are the left and right singular vectors 
of W corresponding to the singular value one, and it is also 
easy to show that all the singular values of W are at most 
1. Therefore, the rest of the singular values and singular 
vectors of W can be found by computing the SVD of the 
following rank-one modification of W 



-1/2 



D'^^'^WDy^^'^ 



D]l'^ee^Dy 



1/2 



IID 



1/2 



6 2 



\D 



1/2 I 



which has (i,j) element 



^Dx{i,i)Dy{j,j) 



^Dxii,i)DY{j,j) 2 

= W Zij, 



and is a constant multiple of the (i, j) element of Z. There- 
fore, normalized-cut based cluster analysis and correspon- 
dence analysis arrive at the same SVD problems even though 
they start with completely different principles. It is worth- 
while to explore more deeply the interplay between these two 
different points of views and approaches, for example, using 
the statistical analysis of correspondence analysis to provide 
better strategy for selecting cut points and estimating the 
number of clusters. 

5. PARTITIONS WITH OVERLAPS 

So far in our discussion, we have only looked at hard clus- 
tering, i.e., a data object belongs to one and only one cluster. 
In many situations, especially when there are much overlap 
among the clusters, it is more advantageous to allow data 
objects to belong to different clusters. For example, in doc- 
ument clustering, certain groups of words can be shared by 
two clusters. Is it possible to model this overlap using our 
bipartite graph model and also find efficient approximate 
solutions? The answer seems to be yes, but our results at 
this point are rather preliminary and we will only illustrate 
the possibilities. Our basic idea is that when computing 
Ncut(yl,i3), we should disregard the contributions of the 
set of vertices that is in the overlap. More specifically, let 
X = AuOxUA and Y = BuOyUB, where Ox denotes the 





I;. 







Figure 1: Sparsity patterns of a test matrix before 
clustering (left) and after clustering (right) 



overlap between the vertex subsets AuOx and AuOx, and 
Oy the overlap between B U Oy and B U Oy , we compute 



Ncut(A,B,^,B) = 



cut(A,B) 



+ 



W{A,Y) + W{X,B) 
cut{A,B) 



W(A, Y) + W{X,B) 



However, we can make 'Ncut{A, B, A, B) smaller simply by 
putting more vertices in the overlap. Therefore, we need 
to balance these two competing quantities: the size of the 
overlap and the modified normalized cut 'Ncut{A, B , A, B) 
by minimizing 

]<lcut{A,B,A,B)+a{\Ox\ + \Oy\), 

where a is a regularization parameter. How to find an effi- 
cient method for computing the (approximate) optimal so- 
lution to the above minimization problem still needs to be 
investigated. We close this section by presenting an illus- 
trative example showing that in some situations, the singu- 
lar vectors already automatically separating the overlap sets 
while giving the coordinates for carrying out clustering. 

Example 1. We construct a sparse m-by-n rectangular 
matrix 



W 



Wll Wl2 

W21 W22 



so that Wll and W22 are relatively denser than W12 and 
W21 ■ We also add some dense rows and columns to the ma- 
trix W to represent row and column overlaps. The left panel 
of Figure |l| shows the sparsity pattern of W", a matrix ob- 
tained by randomly permuting the rows and columns of W. 
We then compute the second largest left and right singular 

say X and y, then sort the rows 



vectors of D 



-1/2 



-1/2 



and columns of W according to the values of the entries in 
D~^^^x and Dy^^^y, respectively. The sparsity pattern of 
this permuted W is shown on the right panel of Figure |l|. As 
can be seen that the singular vectors not only do the job of 
clustering but at the same time also concentrate the dense 
rows and columns at the boundary of the two clusters. 

6. EXPERIMENTS 

In this section we present our experimental results on clus- 
tering a dataset of newsgroup articles submitted to 20 news- 



groups.^ This dataset contains about 20,000 articles (email 
messages) evenly divided among the 20 newsgroups. We list 
the names of the newsgroups together with the associated 
group labels (the labels will be used in the sequel to identify 
the newsgroups). 

NGl: alt. atheism 

NG2: comp . graphics 

NG3: comp . OS .ms-windows .misc 

NG4: comp . sys . ibm.pc. hardware 

NG5 : comp . sys . mac . hardware 

NG6 : comp . windows . x 

NG7 :misc . f orsale 

NG8: rec.autos 

NG9 : rec .motorcycles 

NGIO: rec . sport .baseball 

NGll : rec . sport . hockey 

NG12: sci. crypt 

NG13 : sci . electronics 

NG14: sci.med 

NG15 : sci . space 

NG16: see . religion. christian 
NG17 : talk. politics . guns 
NG18: talk. politics .mideast 
NG19 : talk. politics .misc 
NG20: talk. religion. misc 

We used the how toolkit to construct the term-document 
matrix for this dataset, specifically we use the tokenization 
option so that the UseNet headers are stripped, and we also 
applied stemming jl^. Some of the newsgroups have large 
overlaps, for example, the five newsgroups comp.* about 
computers. In fact several articles are posted to multi- 
ple newsgroups. Before we apply clustering algorithms to 
the dataset, several preprocessing steps need to be consid- 
ered. Two standard steps are weighting and feature selec- 
tion. For weighting, we considered a variant of tf.idf weight- 
ing scheme, tf log2(n/df), where tf is the term frequency 
and df is the document frequency and several other varia- 
tions listed in For feature selection, we looked at three 
approaches 1) deleting terms that occur less than certain 
number of times in the dataset; 2) deleting terms that occur 
in less than certain number of documents in the dataset; 
3) selecting terms according to mutual information of terms 
and documents defined as 

I{y) = X] P^^' log(p(j;, y)/ {p{x)p{y)), 

where y represents a term and x a document jl^. In gen- 
eral we found out that the traditional tf.idf based weighting 
schemes do not improve performance for SRE. One possible 
explanation comes from the connection with correspondence 
analysis, the raw frequencies are samples of co-occurrence 

— 1/2 

probabilities, and the pre- and post-multiplication by 

]^/2 1/2 1/2 

and Dy in {D — W)Dy automatically taking 

into account of weighting. We did, however, found out that 
trimming the raw frequencies can sometimes improve per- 
formance for SRE, especially for the anomalous cases where 
some words can occur in certain documents an unusual num- 
ber of times, skewing the clustering process. 



The newsgroup 
toolkit ioL. 



prncpssmg 



dataset together with the bow 
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ittp: //www. cs . emu. edu/af s/ cs/project/theo-ll/www/ 
naive-bayes.html. 



Table 1: Comparison of spectral embedding (SRE), PDDP, and K-means (NG1/NG2) 



Mixture 


SRE 


PDDP 


K-means 


50/50 


92.12 ±3.52% 


91.90 ±3.19% (53,10,37) 


76.93 ± 14.42% (82, 2, 10) 


50/100 


90.57 ±3.11% 


86.11 ±3.94% (86,5,9) 


76.74 ± 14.01% (80, 2, 18) 


50/150 


88.04 ± 3.90% 


78.60 ±5.03% (98,0,2) 


68.80 ± 13.55% (88,0,12) 



50/200 82.77 ±5.24% 70.43 ± 6.04% (97, 0, 3) 69.22 ± 12.34% (83, 1, 16) 



Table 2: Comparison of spectral embedding (SRE), PDDP, and K-means (NGlO/NGll) 



Mixture 


SRE 


PDDP 


K-means 


50/50 


74.56 ± 8.93% 


73.40 ± 10.07% (56,6,38) 


61.61 ±8.77% (86,0,14) 


50/100 


67.13 ±7.17% 


67.10 ± 10.20% (52,1,47) 


64.40 ±9.37% (59,1,40) 


50/150 


58.30 ± 5.99% 


58.72 ±7.48% (52,1,47) 


62.53 ±8.20% (36,1,63) 


50/200 


57.55 ± 5.69% 


56.63 ±4.84% (58,1,41) 


60.82 ± 7.54% (39, 2, 59) 



For the purpose of comparison, we consider two other clus- 
tering methods: 1) K-means method M: 2) Principal direc- 
tion divisive partion (PDDP) method T3|. K-means method 
is a widely used cluster analysis tool. The variant we used 
employs the Euclidean distance when comparing the dissim- 
ilarity between two documents. When applying K-means, 
we normalize the length of each document so that it has 
Euclidean length one. In essence, we use the cosine of the 
angle between two document vectors when measuring their 
similarity. We have also tried K-means without document 
length normalization, the results are far worse and therefore 
we will not report the corresponding results. Since K-means 
method is an iterative method, we need to specify a stopping 
criterion. For the variant we used, we compare the centroids 
between two consecutive iterations, and stop when the dif- 
ference is smaller than a pre-defined tolerance. 

PDDP is another clustering method that utilizes singu- 
lar vectors. It is based on the idea of principal component 
analysis and has been shown to outperform several standard 
clustering methods such as hierarchical agglomerative algo- 
rithm . First each document is considered as a multivari- 
ate data point. The set of document is normalized to have 
unit Euclidean length and then centered, i,e., let W be the 
term-document matrix, and w be the average of the columns 
of W. Compute the largest singular value triplet {u, a, v} of 
W — we^ . Then split the set of documents based on their 
values of the v = (vi) vector: one simple scheme is to let 
those with positive Vi go into one cluster and those with 
nonnegative Vi inot another cluster. Then the whole pro- 
cess is repeated on the term-document matrices of the two 
clusters, respectively. Although both our clustering method 
SRE and PDDP make use of the singular vectors of some 
versions of the term-document matrices, they are derived 
from fundamentally different principles. PDDP is a feature- 
based clustering method, projecting all the data points to 
the one- dimensional subspace spanned by the first principal 
axis; SRE is a similarity-based clustering method, two co- 
occurring variables (terms and documents in the context of 
document clustering) are simultaneously clustered. Unlike 
SRE, PDDP does not have a well-defined objective function 
for minimization. It only partitions the columns of the term- 
document matrices while SRE partitions both of its rows and 
columns. This will have significant impact on the computa- 
tional costs. PDDP, however, has an advantage that it can 



be applied to dataset with both positive and negative values 
while SRE can only be applied to datasets with nonnegative 
data values. 

Example 2. In this example, we examine binary cluster- 
ing with uneven clusters. We consider three pairs of news- 
groups: newsgroups 1 and 2 are well-separated, 10 and 11 
are less well-separated and 18 and 19 have a lot of overlap. 
We used document frequency as the feature selection crite- 
rion and delete words that occur in less than 5 documents 
in each datasets we used. For both K-means and PDDP we 
apply tf.idf weighting together with document length nor- 
malization so that each document vector will have Euclidean 
norm one. For SRE we trim the raw frequency so that the 
maximum is 10. For each newsgroup pair, we select four 
types of mixture of articles from each newsgroup: x/y indi- 
cates that X articles are from the first group and y articles 
are from the second group. The results are listed in Table 1 
for groups 1 and 2, Table 2 for groups 10 and 11 and Table 3 
for groups 18 and 19. We list the means and standard devi- 
ations for 100 random samples. For PDDP and K-means we 
also include a triplet of numbers which indicates how many 
of the 100 samples SRE performs better (the first number), 
the same (the second number) and worse (the third num- 
ber) than the corresponding methods (PDDP or K-means). 
We should emphasize that K-means method can only find 
local minimum, and the results depend on initial values and 
stopping criteria. This is also reflected by the large standard 
deviations associated with K-means method. From the three 
tests we can conclude that both SRE and PDDP outperform 
K-means method. The performance of SRE and PDDP are 
similar in balanced mixtures, but SRE is superior to PDDP 
in skewed mixtures. 

Example 3. In this example, we consider an easy multi- 
cluster case, we examine five newsgroups 2, 9, 10, 15, 18 which 
was also considered in [^. We sample 100 articles from each 
newsgroups, we use mutual information for feature selection. 
We use minimum normalized cut as cut point for each level 
of the recursion. For one sample, Table 4 gives the confu- 
sion matrix. The accuracy for this sample is 88.2%. We also 
tested two other samples with accuracy 85.4% and 81.2% 
which compare favorably with those obtained for three sam- 
ples with accuracy 59%, 58% and 53% reported in ||l^. In 
the following we also listed the top few words for each clus- 
ters computed by mutual information. 



Table 3: Comparison of spectral embedding (SRE), PDDP, and K-means (NG18/NG19) 



Mixture 


SRE 


PDDP 


K-means 


50/50 


73.66 ± 10.53% 


69.52 ±12.83% (65,12,32) 


62.25 ±9.94% (82,1,17) 


50/100 


67.23 ± 7.84% 


67.84 ±7.30% (46,5,49) 


60.91 ± 7.92% (65, 13, 32) 


50/150 


65.83 ± 12.79% 


60.37 ±9.85% (53,3,44) 


63.32 ±8.26% (58,3,39) 


50/200 


61.23 ± 9.88% 


60.76 ±5.55% (40,1,59) 


64.50 ± 7.58% (34, 0, 66) 



Table 4: Confusion matrix for newsgroups {2, 9, 10, 15, 18} 





mideast 


graphics 


space 


baseball 


motorcycles 


cluster 1 


87 








2 





cluster 2 


7 


90 


7 


6 


7 


cluster 3 


3 


9 


84 


1 


1 


cluster 4 








1 


88 





cluster 5 


3 


1 


8 


3 


92 



Cluster 1 : 

armenian Israel arab Palestinian peopl jew isra 
iran muslim kill turkis war greek iraqi adl call 

Cluster 2: 

imag file bit green gif mail graphic colour 
group version comput jpeg blue xv ftp ac uk list 

Cluster 3: 

univers space nasa theori system mission henri 
moon cost sky launch orbit shuttl physic work 

Cluster 4: 

clutch year game gant player team hirschbeck 
basebal won hi lost ball defens base run win 

Cluster 5: 

bike dog lock ride don wave drive black 
articl write apr motorcycl ca turn dod insur 

7. CONCLUSIONS AND FEATURE WORK 

In this paper, we formulate a class of clustering prob- 
lems as bipartite graph partitioning problems, and we show 
that efficient optimal solutions can be found by comput- 
ing the partial singular value decomposition of some scaled 
edge weight matrices. However, we have also shown that 
there still remain many challenging problems. One area that 
needs further investigation is the selection of cut points and 
number of clusters using multiple left and right singular vec- 
tors, and the possibility of adding local refinements to im- 
prove clustering qualityj^ Another area is to find efficient 
algorithms for handling overlapping clusters. Finally, the 
treatment of missing data under our bipartite graph model 
especially when we apply our spectral clustering methods to 
the problem of data analysis of recommender systems also 
deserves further investigation. 
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APPENDIX 

A. SOME PROOFS 

In this appendix we prove three results: 1) All the eigen- 
values of D~^^^W D~^^^ has absolute value at most 1. Equiv- 
alently, we need to prove that the eigenvalues of the gener- 
alized eigenvalue problem Wx = XDx has absolute value 
at most 1. In fact let x = {xi)"^i and let i be such that 
\xi\ — max \xi 

J=l 

that 

n 

\X\ < ^ Wij/d^ = 1. 

2) We prove that 

^n-i/2Trrn-i/2N 2x'^Wy 
o-max(-D_y WDy ) = max — — ■ — — . 

^ x^o,y^o x^ Dxx + DyV 

Let X — D^l^x and y — Dy^y, then 



2x'^Wy _ 2SFD^'-'^WDy^'^y 



(9) 



x^Dxx + y'^Dyy x^x + y^y 

Let D'^^^WD'^^^ = f/EV^ be its SVD with 

U = [Ul , . . . , ttm] , V" = [Vl,. . . ,Vn] 

and 

E = diag(cri, . . . ,0-min{m,n}), (^1 = CTmax (-D^^^^ ^^-^ly ) • 

Then we can expand x and y as 

X = '^X^U^, y = '^yiVi, (10) 

i i 

and (|9f) becomes 



< ^ -o . ^ ~9 < f^l- 



Taking xi — 1 and yi — 1 achieves the maximum. 
3) Now we consider the constraint 

x^Dxe + y^Dy-e = 

which is equivalent to xi + yi = using the expansions 
in (p^. We can always scale the vectors x and ij without 
changing the maximum so that Xi > and yi > 0. Hence 
XI + yi = implies that xi = yi = 0. It is then easy to see 
that 



2x' Wy I T 



172 = max -j , " ^r-, — I x^ Dxe + y^ Dye = 



x'^Dxx + y'^Dyy 
and the maximum is achieved by the second largest left and 

1/2 1/2 

right singular vectors of Dj^ WDy 



